Add IPC stream interface for zero-copy Arrow data access#278
Closed
jadewang-db wants to merge 3 commits intodatabricks:mainfrom
Closed
Add IPC stream interface for zero-copy Arrow data access#278jadewang-db wants to merge 3 commits intodatabricks:mainfrom
jadewang-db wants to merge 3 commits intodatabricks:mainfrom
Conversation
2ebf522 to
12801b3
Compare
| initialRowSet *cli_service.TRowSet, | ||
| schemaBytes []byte, | ||
| cfg *config.Config, | ||
| ) (dbsqlrows.IPCStreamIterator, error) { |
Collaborator
There was a problem hiding this comment.
do we have a scenario where we could return error here?
|
|
||
| if fetchResult == nil || fetchResult.Results == nil || fetchResult.Results.ArrowBatches == nil { | ||
| return nil, io.EOF | ||
| } |
Collaborator
There was a problem hiding this comment.
this assumes that fetchResult will always arrow batches but we could also have cloud fetch links, we could use BatchIterator to abstract those details for us: https://github.com/databricks/databricks-sql-go/blob/main/internal/rows/arrowbased/arrowRecordIterator.go#L141-L162
| if r.resultSetMetadata != nil && r.resultSetMetadata.ArrowSchema != nil { | ||
| schemaBytes = r.resultSetMetadata.ArrowSchema | ||
| } else { | ||
| // Fall back to generating from table schema |
Collaborator
There was a problem hiding this comment.
we already have tTableSchemaToArrowSchema in arrowRows
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR introduces a new
IPCStreamIteratorinterface that provides zero-copy access to Arrow data through IPC (Inter-Process Communication) streams. This enhancement allows downstream consumers to efficiently access Arrow data without incurring serialization/deserialization overhead.Problem Statement
Currently, the databricks-sql-go driver returns Arrow data through the
GetArrowBatches()method, which provides deserialized Arrow v12 records. When consumers use a different Arrow version (e.g., Apache Arrow ADBC uses v18), this requires expensive conversion between versions:Solution
This PR adds a new optional interface that exposes raw Arrow IPC streams:
Key Benefits
Implementation Details
New Files
rows/ipc_stream.go- Public interface definitionsinternal/rows/arrowbased/ipc_stream_iterator.go- ImplementationModified Files
internal/rows/rows.go- AddedGetIPCStreams()methodKey Features
Usage Example
Performance Benchmark
Tested with 100K rows:
Testing
Breaking Changes
None. This is a purely additive change:
GetArrowBatches()method unchangedFuture Considerations
Related Context
This enhancement was driven by the Apache Arrow ADBC integration, where we identified significant performance overhead when converting between Arrow versions. However, this improvement benefits any consumer that:
Checklist
Questions for Reviewers